Goto

Collaborating Authors

 activation sparsity


Polar Sparsity High Throughput Batched LLM with Scalable Contextual Sparsity

Neural Information Processing Systems

Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop Selective Head Attention with hardware-efficient, sparsity-aware GPU kernels, delivering up to 2.2 end-to-end speedups for models like OPT, LLaMA2 & 3, Qwen, Mistral across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems.


Spark Transformer: Reactivating Sparsity in Transformer FFN and Attention

Neural Information Processing Systems

The discovery of the lazy neuron phenomenon [54], where fewer than 10% of the feedforward networks (FFN) parameters in trained Transformers are activated per token, has spurred significant interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits across CPUs, GPUs, and TPUs, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity, e.g., by reverting to ReLU, applying top-kmasking or a sparse predictor, often degrade model quality, increase parameter count, complicate training.


Spark Transformer: Reactivating Sparsity in Transformer FFN and Attention

Neural Information Processing Systems

The discovery of the *lazy neuron phenomenon* (Li et al., 2022), where fewer than 10% of the feedforward networks (FFN) parameters in trained Transformers are activated per token, has spurred significant interests in *activation sparsity* for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits across CPUs, GPUs, and TPUs, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity, e.g., by reverting to ReLU or applying top-k masking, often degrade model quality, increase parameter count, or complicate training.



Learn To be Efficient: Build Structured Sparsity in Large Language Models

Neural Information Processing Systems

Large Language Models (LLMs) have achieved remarkable success with their billion-level parameters, yet they incur high inference overheads. The emergence of activation sparsity in LLMs provides a natural approach to reduce this cost by involving only parts of the parameters for inference. However, existing methods only focus on utilizing this naturally formed activation sparsity in a post-training setting, overlooking the potential for further amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can learn to be efficient by achieving more structured activation sparsity. To achieve this, we introduce a novel training algorithm, Learn-To-be-Efficient (LTE), designed to train efficiency-aware LLMs to learn to activate fewer neurons and achieve a better trade-off between sparsity and performance. Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based models, LTE can also be applied to LLMs like LLaMA using non-ReLU activations. Extensive evaluation on language understanding, language generation, and instruction tuning tasks show that LTE consistently outperforms SOTA baselines. Along with our hardware-aware custom kernel implementation, LTE reduces LLaMA2-7B inference latency by 25% at 50% sparsity.


Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Neural Information Processing Systems

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-$k$ expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wall-clock speedup. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60\% without significantly impacting performance.